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K-5 Summary. We investigate the problem of estimating a smootli invertible transformation / 

.,jj. when observing independent samples Xi, . . . ,X„ ~ P o / where P is a known measure. 

^__i We focus on the two dimensional case where P and / are defined on E^. We present a 

flexible class of smooth invertible transformations in two dimensions with variational equations 

I— I for optimizing over the classes, then study the problem of estimating the transformation / by 

P^ penalized maximum likelihood estimation. We apply our methodology to the case when P o / 

^ has a density with respect to Lebesgue measure on E^ and demonstrate improvements over 

^^ kernel density estimation on three examples. 

a 

c/2 1. Introduction 

I In this paper we investigate the problem of estimating the probabihty distribution of a 

^ random vector X from independent copies, 

in 

CN 

. where P is a known measure on M'' and / is an unknown smooth bijection of R'^. We specif- 

t^ ically focus our attention to the two dimensional case, where the theory of quasiconformal 

maps is at our disposal. The estimate is constructed based on the fact that the distribu- 
tion of X is characterized by f{X) ~ P. Therefore, by attempting to "deform" the data 
Xi, . . . , Xn by a transformation / which satisfies. 



o 

00 
O 

>^ /(Xi),...,/(x„)l''p, 

"^ we get an estimate of the distribution of X. In what follows we iteratively deform the 

data toward the target distribution P using a gradient ascent type algorithm over a class 
of transformations. For the remainder of this paper we reserve the term 'transformation' 
or 'deformation' to refer to a smooth map which is a bijection of M'*, i.e. invertible and 
surjective. 

Deformations have been used in many types of statistical problems. For example, de- 
formations are used in spatial statistics for developing nonstationary random fields (see 
Sampson and Guttorp (1992), Schmidt and O'Hagan (2003), Damian et al. (2001), Perrin 
and Meiring (1999), lovleff and Perrin (2004), Clerc and Mallat (2003), Anderes and Stein 
(2008)). Another example is the use of deformable templates for computer vision and med- 
ical imaging problems (see Younes (1999), Joshi and Miller (2000), Dupuis et al. (1998), 
Bajcsy and Broit (1982), Bajcsy et al. (1983), Bookstein (1989), Amit (1994)). There has 
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also been some work on using transformations to boost the efficiency of kernel density es- 
timates. For example, in the one dimensional setting Ruppert and Cline (1994) estimate a 
transformation of data to a uniform distribution in order to reduce the bias of kernel density 
estimation. In a similar manner, data sharpening techniques, developed by Choi and Hall 
(1999) and Hall and Minnotte (2002) are used for improving linear density estimates and 
are available in multiple dimensions. These techniques perturb the data slightly to produce 
bias reduction. However, the perturbations need not be bijective, and so do not represent 
proper transformations. 

Although deformations provide a powerful modeling tool, there are many challenges with 
their practical implementation. The invertability condition, in particular, is one of the main 
difficulties for constructing flexible classes of transformations and searching within these 
classes. One of the most common techniques for dealing with the invertability condition is 
the use of time varying vector field flows for constructing deformations. These vector fields 
{ut}t>o define deformations {ft}t>o indexed by time i > which can be used for finding a 
minimizer of some data dependent penalty. In this paper we use a different characterization 
of deformations to construct classes of deformations indexed by a parameter vector 6. 
However, we derive a correspondence between our parameterization and a time varying 
vector field characterization of the corresponding deformations, whereby relating the two 
characterizations . 

Our basic approach for estimating / is to maximize a penalized likelihood over classes 
of smooth transformations. Of course, this requires the existence of a density, which will 
be guaranteed by mild smoothness conditions on P and /. In two dimensions, we develop 
flexible classes of smooth invertible transformations generated by basis functions, then use 
the theory of quasiconformal maps to derive variational formulas that relate perturbations 
of a parameterization to perturbations of the transformations. This, along with a formula 
for the rate of change of the likelihood allows us to search the parameter space using 
gradient-based optimization of a penalized likelihood. 

We finish this section with a more detailed overview of the results of this paper which, 
we hope, will help the reader follow the remainder of the paper. In section 3, we construct 
our class of deformations by using a nonlinear transform of the linear span of a set of basis 
functions, the coefficients of which generate the parameterization J^ = {/^ : G O C M.°°}. 
We then use known results from quasiconformal maps, outlined in Section 2, to derive 
variational relationships that relate a perturbation of the parameters 6 to the rate of change 
of the likelihood i{0) (see Sections 3 and 4). In particular, let 9 and dO be two parameter 
vectors such that + edO E O for all sufficiently small e > 0. The results in Section 3.1 
show that 

f+'^'' = ,f + eu^'''''of + o{e) (1) 

as e ^ 0, where u^^'^^ is a vector field which depends on both 6 and dO. In Section 5, 
we present some of the algorithmic details for numerically recovering u^^'^^ from 9 and d9. 
We notice that under mild conditions there exists a partial differential equation which is 
numerically solved to approximate u^"^^ . 

The variational results for /^ are developed primarily to allow one to perform a gradient- 
based optimization of a penalized version of the likelihood oi Xi, . . . , Xn ~ P o /^ . For the 
existence of a likelihood one needs sufficient smoothness conditions to ensure P o /^ has a 
density. To this end, we assume that P has a strictly positive differentiable density with 
respect to Lebesgue measure on K^. By writing the density as expTJ for some differentiable 
function H : M.'^ -^ M. and by assuming / is an orientation preserving C^ diffeomorphism 
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of M^, P o / has density \Jf\ expH o f where \Jf\ is the determinant of the Jacobian of / 
(always positive by the orientation preserving assumption on /). Let £{9) denote the log 

likelihood of the sample Xi . . . , X„ ~ P o /^ so that 

n 

m -^J2^og\J^{Xk)\+ H o f{X,). (2) 

fc=l 

Now define the rate of change of the log likelihood at 9 in the direction d9 by £[9]{d9) := 
lirrig^o — - — — when it exists. In Section 4 we derive the following formula for £ 

n 

l[9]{d9) = ^div««'''«(n) + (««'''«(n), Vff(n)) (3) 

fc=i 

where Yk = f^{Xk), (•, •) is dot product in Euclidean space, m^'^^ is the vector field 
found in (1), and divu'''''^ is the divergence of u^''^^. A heuristic interpretation of (3) 
can be given by viewing the two terms as being competitors for maximizing the rate of 
increasing of the log-likelihood. In particular, let / be a proposed transformation of the 
data so that f{Xi), . . . , f{Xn) are approximately distributed as P. Consider attempting a 
small perturbation by a vector field u, f + eu o / + o(e), that increases the log- likelihood. 
To find a good perturbation one wants to maximize the two terms: div u{f{Xk)) and 
{u{f{Xk)),\7H{f{Xk))), for /c = 1 . . . ,n. Notice that (li\u{f{Xk)) measures local expan- 
sion at f{X}S) so that large values of divu(/(Xfc)) correpond to an expansion the region 
surrounding f{Xk)- In contrast, the term {u{f{Xk)),'^H{f{Xk))) is large when the vector 
field gravitates toward the modes of the density of P. This gives an expansion-contraction 
competition for increasing the rate of change of the log-likelihood and by suitably balancing 
these two competing terms one gets the best rate of increase of the log-likelihood. 

In Section 5 we use equations (1) and (3) to perform a gradient ascent of a penalized 
version of the log likelihood. By truncating the parameter space 0^ :— {9i, . . . ,0^), the 
gradient ascent path {9^}t>o is characterized by 

|<=.v£«)-Ava«) 

where 3 is a regularization penalty, A > is a tuning parameter and 

fiKKeiY 

Vii9^) = ■■ 

where ei, . . . , e^v are the standard basis vectors for M.^ . The numerical details for computing 
V£{9^) and implementing the gradient-based optimization are also given in Section 5. 
Finally, in Section 6, we present a series of simulation experiments. Our estimates, which 
are constructed using a Fourier basis to generate the parameterization /^ , compare favorably 
to a kernel density estimate. 

2. Quasiconformal maps and Variations 

We briefly review some results from the theory of quasiconformal maps. For more a complete 
treatment see Ahlfors (2006), Krushkal' (1979), Lawrynowicz (1983), Lehto and Virtanen 
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(1965) . The main result of quasiconformal theory used in this paper is the characterization of 
quasiconformal maps by another function called a complex dilatation. Complex dilatations 
are important for two reasons. First, they are only required to be measurable and bounded 
above by some fc < 1, which makes it easy to construct classes of smooth transformations. 
Secondly, there is a constructible correspondence between the perturbation of a complex 
dilatation and the perturbation of the corresponding transformation. This is used to derive 
the rate of change of the likelihood (see equation (3) ) as one varies a parameterization of 
smooth transformations. 

For a C^ diffcomorphism / define 

The complex dilatation of /, denoted iif or just /i, is defined as /i :— df/df. The value 
of the complex dilatation at a point z, ^i{z), characterizes the infinitesimal ellipse about 
z, which gets mapped to a infinitesimal circle under the image of /. In particular, the 
eccentricity of the ellipse is given by jzjr\ and the inclination is given by arg(— /i/2). We 

will sometimes use the alternative notation fz, fz for df, df when doing so makes an 
equation more readable. 

By generalizing to weak derivatives for df and df, one can extend the definition of the 
complex dilatation Hf := df/df. A quasiconformal map is defined to be a homeomorphism 
/ with locally square integrable weak derivatives such that ess sup \^f\ < 1. This condition 
on (If is equivalent to requiring that the eccentricities of the local ellipses characterized by 
fj,f be a.e. bounded. We now have the following Theorem found in Ahlfors (2006), page 57. 

Theorem 1. For any measurable ii with ||/i||oo < 1 there exists a unique quasiconformal 
mapping /^ with complex dilatation /i that leaves 0, l,oo fixed. 

We also have the nice fact that all quasiconformal maps with a given dilatation are found 
by post composing the unique map /^ in Theorem 1 with a conformal map. In this way, 
any quasiconformal map / : C ^ C with dilatation /i has the form a/^ + b for a, fe G C The 
usefulness of this characterization is seen in the construction of classes of quasiconformal 
maps in Section 3. We finally mention that if /x is known to be Holder continuous, the map 
/ will be a C^ diffcomorphism. 



2. 1 . The dependence of /^ on ^ 

Besides characterizing quasiconformal maps, the other important property of the complex 
dilatation is that f^ depends differentially on jjl so that a small perturbation 11 + ev leads 
to a vector field perturbation /^ + eu o /^ + o(e) of /^. An integral equation then relates 
V and the vector field u. 

To be more precise suppose {/it}t>o is a class of deformations such that ^t+t — fJ't + 
evt + e7(i, e) where "f{t,e),vt G ^oo, IImJoo < 1, and ||7(t, e)||oo ^ as e ^ 0. Then by 
Theorem 5, page 61 of Ahlfors (2006), 

/^'+^=.r'+ewto/A"+o(e), (4) 

uniformly on compact subsets where, 

UtiC) ■■= -\ ff L^'>^t{z)R{z,Odxdy, (5) 
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for R{z,C) - J%llo ' ^''^ •= {t^wIf} ° (/"")"' and z = X + ty. By Holder's 
inequality the integral indeed exists when L^* i/t G ioo • By properties of the integral (5) 
the vector field Ut satisfies 

dut = Lf^'iyt, utiO) = util) = 0, (6) 

in the distributional sense when L^*f( G Lp for some p > 2 (see Lemma 3, page 53 of 
Ahlfors (2006)). We use (6) to approximate the integral (5) in Section 5. 

Finally, one of the consequences of (4) is that for suSiciently smooth {ut}t>o the deter- 
minant of the Jacobian of Z''* at a; e M^, denoted | J((a;)|, exists and satisfies 

j^\Mx)\^\Mx)\dWutoMx). (7) 

For references see Theorem 3.1.2 of Hille (1969) or Corollary 3.1 of Hartman (2002) (page 
96). It is (7) that gives the rate of change of the likelihood £[9]{d6) in equation (3). 

3. Classes of Quasiconformal maps 

To ensure that P o / is a genuine probability measure, one must require that the C^ dif- 
feomorphism, /, be invertible. It is this nonlinear condition that is the main obstacle for 
constructing a rich class of transformations, {/^ : 9 E Q C M°°}. In this section, to over- 
come this difficulty, we use the fiexibility of complex dilatations /i to construct our class of 
transformations. 

By defining a set of basis functions (p^ : C ^ C for k = 1,2,... we formally construct a 
class of complex dilatations. 



k J 



M:= ln= Y^r— -.Lp^y '^Ck^k,CkeM.} . (8) 



The reason for introducing the transformation ip i—^ j^—r is to remove any restriction, 
besides measurability and boundedness, on the linear span ip = ^^ (pk to define a class 
of quasiconformal maps. Indeed, by Theorem 1, for any complex dilatation /i G M with 
ip = 'Ylik^-k'fk G Loo, there exists a unique quasiconformal map /^ — af^ -\- b, where 
6 — (fli, 02, 6i, 62, ci, C2, . . .), a = ai + ia2 7^ and b — bi + ib2. Now we construct the full 
class 3^ of transformations, 

9:={.f: 0eecM°°}, 

where O is the set of parameter values for which ^^ Ck^Pk G Loo and a ^ 0. 



3. 1. Variational Relationsiiip between and f^ 

One of the main components of the gradient ascent algorithm is the computation of the 
vector field perturbation of /^ that results from a small change of the parameter vector 
+ edO. We start this section by studying how the perturbation 6 + edO propagates to /i, 
then finish with a derivation of the expansion /^ + eu^''^^ ° f^ + o{e). 

Suppose (p^dip G Lao are both members of the linear span of {(/?fc}fcGN- We start by 
showing that perturbing (p by ip + edip results in fj, + ei^ + o(e) where 

2+|y| — </^' .„. 

'^-^^2(rTM?-^^2M(i + M)^ (^) 
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where, for the rest of this paper, the second term of the right hand side of (9) is understood 
to be zero at the points z such that ip{z) — 0. To be more precise, let n^p = j^ .^ , and 
v^{dLp) := V as in (9). Suppose 93 = Y^=\ ^k and d(f = YlV=i '^'^k ^k are complex functions 
in Loo with real coefficients {ck\k>i and {dck\k>i- Then 

Mvp+edip = Mv + e'^v('^'/') + *^7(e) (10) 

where v^p{dLp),^{e) G Lryo and ||7(e)||oo ^ as e ^ 0. To see why, let 'J(z) = yctjt, where 
z = x + iy and notice that d7{z)/dx and dJ{z)/dy are continuous on C and satisfy 



9T(z) 

dx 
d7{z) 

dy 



1 



1 + kl 
i 



y- 



kl(i + kl)^ 

z 



where the last term is when z = 0. Therefore directional derivative i9„T, in the direction 
■u G C, exists and is continuous on C for any fixed u = u\-\- iu^ and is given by 



a„T(z) 



9T(z) 
dx 



V\z\ 



9 T(z) 
dy 

uz + uz 



"2 



kl(l+kl)^ 



2{l + \z\Y 2|z|(l + |z|)^ 



where the last term is when z = 0. Therefore the map T has a Frechet derivative 
'J^('u) := du7{z) and by the mean value theorem (see Dieudonnc (I960)) 



T(^(z) + ed^{z)) - T(^(z)) 



--^'M^^) 



< \dip{z)\ sup 



7' - 7' 



•p{z) 



for a fixed z G C, where 5* is the line connecting ^{z) + ed(p{z) and (/^(z), and || • || is the 
standard operator norm on the space of continuous linear mappings from 



to 



Now 



sup 



T T' 



< sup 



^? ~ "^U^) 



where ||(i(/3||oo = B. The last term converges to zero uniformly in z G C as e 
is bounded and T' is uniformly continuous on compact domains. Therefore 

T(^ + ed^) = 7{^) + e7'{d^) + e^{e) 



since <f{z) 



where |l7(e)|loo ^ as e ^ 0, which gives (10). Finally notice |:/y((i<^)| = |T' ((i(/3)| < 

||d(/3||oo/(l+|</3|) < WdifiW^ <00. 

Now we can combine all the perturbation results to derive variational relationship be- 
tween the coefficients and the maps. Suppose = {9i,92,---) and dO = {dOi^dO^, ■ ■ ■) 
are two parameter vectors such that 9 + edO G O for all sufficiently small e > 0. Let 
V = Z]fcl4 Sk^k, dip = Y^'k=A d^^kfk, ^^ = T+M^ ^^^ ^ defined by (9). Also let da = ddi+id92 
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and similarly db = dO^ + id94. The following display summarizes how the perturbation of 6 
by edO propagates through /^: 



e + ede< 



ip + edip 
a + eda 
b + edb 



by (10) 



^ + eiy + o(e) 
a + eda 
b + edb 



by (4) 



/^ + ew o /^ + o(e) 
a + eda 
b+edb 



^here uo f^{0 ^ -^ JJ L'^i^{z)Riz, ,riC))dxdy i,nd L^i. = {j^^} o {p)-\ To get 
the perturbations in the form f^+^d.^ = f^ + eu^'"^^ o f^ + o{e) we simplify equation 

^f + e{au o /^ + da/^ + d6) + o(e). 



Notice 

"°,r(c) 



1 

TT 

1 

TT 

1 

aTT 






(fe-/"(c))(a+fc-r(c)) 
(/^w-r(c))(&-rw)(«+&-/^w) 



(9r(z))^ 



dxdy 



L'^i^)}'-Ly^}r'-^y\<^^^y^ 



where L^i 



where 



{z-f^{0){b-z){a + b-z) 
< ^_'f |2 == > o (f^)^^. Therefore, finally, we have 



<,d9 



iC) = db+-iC-b) 



L%{z) 



{b-OJa + b-O 
iz-C)ib-z)ia + b-z) 



dxdy. 



If we have the nice situation that L^v ^ Lp for some p > 2 it can be shown that dt 
in the distributional sense where u^''^^{b) = db and u^''^^{a + b) = da + db. 



<de 



(11) 



(12) 



= L«j. 



4. Rate of Change of the Likelihood 

We now return to the observation scenario where we have iid observations Xi , . . . , Xn from 
the measure P o /^ where P is a known measure on M^ and /^ is known only to be a 
member of the class of invertible transformations 5" given by (3). We also suppose that P 
has a strictly positive differentiable density with respect to Lebesgue measure on M^ and 
write the density exp H for some differentiable function iJ : M^ ^ M. Since H is assumed to 
be known, the likelihood of the observations Xi, . . . , Xn, as a function of e O, is given by 
(2). Let I J^+^''^(a;)| denote the determinant of the Jacobian of the map f^+'^d.^ evaluated 
at X e M^. Notice that equations (11) and (7) give 



d 

de 
d 

de 



\og\j^+'''\x)\ -divu«''^^o/«(a;) 

H o /«+^'^«(x)l = (^i^^''^ o f{x),\/H o f{x)) 



(13) 
(14) 
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for all X e M^ where u^'^^ is given by equation (12) and l^-, •) denotes dot product in 
Therefore 



e=0 



d~e 



e=0 



i[e]{de) := ^\i{e + ede) 

n 

J2 log iJ^+'^'^iXk)] + H o f+'''%Xk) 

= ^divw«''^^(rfe) + (t.«''^^(rfc), vi/(n)), 
fe=i 

where Yk = f^{Xk) and u^'^^ is given by (12). Notice that this gives formula (3) given in 
the introduction. Finally we mention that £[0](-) is linear over the reals so that 

i[e]{de) = Y,d9km{ek) 

k 

where 8 = ^j, OkCk and d9 = ^j, dO^ek where where e^ = (0, . . . 0, 1,0,.. .) with the non- 
zero element at the fc"^ index. 

5. Algorithmic details 

In this section we discuss the algorithmic details for a gradient ascent of the likelihood 
i{0). In our implementation of the algorithm, we slightly adjusted these techniques by 
using quasi-Newton updates instead of gradient ascent updates. However, the essential 
components of the algorithm are still captured by gradient ascent while allowing succinct 
exposition. Therefore, we reserve the details of the quasi-Newton updates for the next 
section. 

The gradient ascent will be done by truncating the parameter space 0^ := (^i, . . . ^9n) 
and using (3) to write, 

/i[0^](ei)\ „ /divui(n) + (ui(n),vi7(yfe))\ 

v^(0= ; =E ' (15) 

V40^](ew)/ '^^i VdivwAr(n) + {uN{Yk).VH{Yk))l 

where Ui := u^ '"^^ , . . . , mjv = u^ ''^" and Y^ ~ f{Xk). Gradient ascent is then written as 
a solution to, 

l^f = V^(0f)-AVa(0f), (16) 

where 3 is a regularization penalty. The most difficult part of of this algorithm is solving 
Ui, . . . ,un at each step in a discretization of (16). We start this section with a detailed 
discussion of this problem and finish with a few comments on the recursive nature of the 
gradient ascent. 

To find the vector fields ui, . . . , un in (15) one needs compute the singular integral (12). 
When ly has compact support, the integral (12) has the form (PL^i/)(C) -I- (linear function of () 
where Ph{C,) := —- JJ h{z)/{z — ()dxdy is the Cauchy Transform. After a simple scal- 
ing, one has available the computational techniques found in Daripa and Mashat (1999) 
for computing the Cauchy Transform over the unit disk. In particular, suppose ft, is a 
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function with compact support in fl and let c be a constant such that c > \z\ for all 
z G i^. Then hc{z) := h{cz) has support in the unit disk D and Ph{c() = cPhdC) — 
— ^ //d hc{z)/{z — Qdxdy for all C S D. Now one can use Diripa and Mashat's method to 
approximate JL hc{z)/{z — C,)dxdy. Finally one needs to find the linear correction to PL^v 
for the integral (12). This is accomplished by the two necessary conditions u^''^^{a) = da 
and u^''^^{a + b) = da + db. 

A second approach for approximating u^''^^ is to solve the nonhomogeneous Cauchy- 
Riemann equations du ' = L v. The solution is only unique up to additive analytic 
functions, however, if it is known that L^v S Cq then PL^v E C^ and dPL^v G L2 (see 
Lemma 2, page 52 of Ahlfors (2006)). Therefore any continuous solution to du^''^^ ~ L^v 
which satisfies du^'"^^ G L2 is an additive constant difference of PL^v. Therefore one can 
find u^^'^^ by solving the nonhomogeneous Cauchy-Riemann equation du^''^^ = L^iy' with 
free boundary conditions on dil, where il is the support of L^u. 

In our implementation of the algorithm the Diripa and Mashat's method suffered from 
some instabilities. We instead used an ad-hoc method for trying to invert the equation 
Qy^de _ j^e^ j^y j-epresenting u^^'^^ by a latent complex function w such that u = dw (this is 
basically a stream function and potential function representation of a vector field with the 
coordinates switched) then using the fact that Add — A, the Lapacian operator, we used 
a fast Poisson solver in Matlab to invert Aw — AL^v. One drawback to this technique 
is that it potentially differs from PL^v by an additive analytic function. In the cases we 
studied the magnitude of the analytic difference was small enough to be ignored. However, 
we would prefer Daripa and Mashat's method but leave it's proper implementation to more 
adept numerical programmers. 

One particularly attractive feature of the algorithm is its recursive nature. Suppose at 
step 771 of a discretized gradient ascent one has /^^ and /f "• evaluated on a dense grid. We 
show how to find /''™+i and /^""^^ at step m + 1. To makes the following formulas more 
readable let jjP'" — ^_m' 7^ , be the complex dilatation of /^", where (/?^™ = X]fc>4(^rra)fc'/'fc 
is given by the basis expansion in (8). The gradient update is Om+i — dm + ecf^m where 

dem-(3[0™](ei),a[0™](e2),...). 

Remember that for each j > 4, 

n 

a[0,„](e,) = ;^div«,(rfe) + {uj{Yk),VH{Yk)), 
fe=i 

where, Y^ — f^"^{Xk), duj = L^^^Vj and Vj is given by 



Notice that computing L "^Vj can be accomplished by simply deforming the graph of 
i-\'im\2 %^ by /^™. Now /^'"+i := /^^ + ^^e^Ae^ ^ je^ ^^^ using the composition 

rule for d and fact that du^™^'^^™ — L^^v we get the following update for fz"^^^ 



where v = J2k>M^rn)kVk 
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Fig. 1. Panel of Density Estimates: The true sampling density is shown in the central column; in 
the left and right columns are density estimates based on the same sample of 10,000 points using 
respectively the deformation-based methods of this paper and kernel density estimates. There is one 
row per example: halfg, stroke, waffle. As indicated by the colorbar the scale for the heatmaps is non- 
linear to accentuate differences in low-density regions (a square-root transform is used, consistent 
with Hellinger distance). A solid line is drawn at the contours of the density that capture 95% of the 
mass; the 50% line is dashed, and the 25% line is dash-dotted. 
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Table 1. Table of the average integrated squared errors, sum- 
marized across six independent realizations. Averages (and 
parenthetical standard deviations) are reported times 10* 

Sample Size 
Case Method 10,000 1,000 100 



Halfg 


deformation 


59 (4) 


109 (19) 


325 (83) 




kernel 


190 (8) 


409 (34) 


880 (181) 


Stroke 


deformation 


24 (8) 


122 (30) 


909 (209) 




kernel 


157 (18) 


642 (81) 


2354 (454) 


Waffle 


deformation 


64 (5) 


222 (30) 


674 (30) 




kernel 


74 (5) 


259 (26) 


662 (49) 



6. Simulation Experiments 

To investigate the quality of the density estimates obtained by the deformation method 
we performed a series of simulation experiments. We first give an overview of the results; 
details are explained afterward. Data sets of size 100, 1000, and 10,000 were generated 
independently from three distributions that we call halfg, stroke, and waffle. Each example 
is intended to demonstrate different properties of our estimator; the examples, and the 
results are illustrated in Figure 1. 

The halfg example was chosen to highlight the ability of the deformation method to 
1) adapt to sharp edges of the density 2) utilize qualitative knowledge about the form of 
the density, namely through the use of a softened half-normal as the target density. The 
stroke example, in which a normal density is stretched out into an "S" with the thickness 
varying along the length, is intended to test the ability of the deformation to capture 
the overall structure well enough to extend mass into the tails appropriately. The waffle 
example tests the estimates performance in a complex, bumpy case. Both the stroke and 
the waffle examples use a radially symmetric bivariate Gaussian distribution with mean 
as the target. For stroke, the target has standard deviation 0.05; for waffle, the target 
has standard deviation 0.5. From these choices we find that the method can successfully 
reconstruct the long thin stroke density by "stretching" a small target, and can also match 
the complex waffle pattern onto a moderate target. 

For each data set, the optimal penalty parameter A was chosen to minimize the integrated 
square error (ISE) difference between the density estimated for that A and the true sampling 
density. The minimal ISE was recorded and compared with the minimum ISE achieved by 
kernel density estimation at the optimal bandwidth. To determine the sampling variability 
of the findings, this was repeated six times for each example. These results, as well as results 
for Hellinger and L^ distance measures, are tabulated in Tables 1, 2, and 3. Examples of 
the estimates achieved on data sets of 10,000 are shown in Figure 1. 



6. 1. Sampling Densities 

In this subsection we give the detailed recipes for the sampling distributions and target 
densities used in the examples. 

Halfg: Let x — (xi,X2)*, let ^(x) — {xi + h{x2),X2Y, where /i(w) = sin(5u)/15 — 
■utanh(M)/3. The data were generated by sampling from a symmetric bivariate normal with 
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Table 2. Table of the average Hellinger-distance errors, summa- 
rized across six independent realizations. Averages (and paren- 
thetical standard deviations) are reported times 10* 

Sample Size 
Case Method 10,000 1,000 100 



Halfg 


deformation 


863 (17) 


1160 (43) 


1915 (250) 




kernel 


1881 (19) 


2718 (54) 


3860 (236) 


Stroke 


deformation 


443 (33) 


992 (110) 


2601 (187) 




kernel 


1083 (32) 


2048 (53) 


3800 (285) 


Waffle 


deformation 


1021 (31) 


1826 (68) 


3002 (116) 




kernel 


1193 (24) 


2105 (66) 


3149 (78) 



Table 3. Table of the average L1 -distance errors, summarized 
across six independent realizations. Averages (and parenthetical 
standard deviations) are reported times 10* 

Sample Size 
Case Method 10,000 1,000 100 



Halfg 


deformation 


544 (46) 


955 (93) 


1808 (250) 




kernel 


1385 (35) 


2382 (115) 


3980 (475) 


Stroke 


deformation 


487 (71) 


1124 (120) 


3104 (259) 




kernel 


1268 (53) 


2572 (124) 


5018 (442) 



Waffle deformation 1535 (48) 2813 (136) 4814 (139) 
kernel 1726 (38) 3153 (113) 4969 (117) 
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standard deviation 1/2, rejecting points in the right half plane, and applying the transfor- 
mation ^ to the remaining points. In effect, then, the density of the observed points is a 
deformed half-normal. Notice that the indicator introduces a smoothly curved sharp edge 
to the density: 

2(1/a/2^)2(1/2)2 exp(-l/8((a;i - h{x2)f + xl)l,^_^,^^<„ 

For the halfg example, instead of using a standard Gaussian as our target we used a 
softened half-normal; the softening was done so that the target would be a continuously 
differentiable density to facilitate optimization, yet still describe a half-normal-like distribu- 
tion. Specifically, the target distribution is described by independently drawing X2 from a 
normal with mean and standard deviation ^ , and Xi from the mixture of half-normals with 
density 7[2(/)^^(x)I^>o] + (l-7)[2<^<T_(a;)Ia:<o], where cr_ = i, cr+ = ^, 7= (l-hcr-/cr+)"\ 
and (/ict is the density of a normal with mean and standard deviation a. 

Stroke: Let ^(a;) = (a;i/2, ^(3 -I- 2tanh(a;i))a;2 — |sin(xi)) . The samples are generated 
by sampling X = (Xi,X2)* as a pair of independent standard normals, and applying 
transformation ^P. The target density was taken to be a mean symmetric bivariate normal 
with standard-deviation 0.05. 

Waffle: Let ^(x) = | {xi + sin(27ra:;i)/(27rs), 2:2 + cos(27rx2)/(27rs) + {xi/iY)\ where we 
take s — ^. The samples are generated by sampling X = {Xi, X2Y as a pair of independent 
standard normals, and applying transformation ^. The target density was taken to be a 
mean symmetric bivariate normal with standard-deviation 0.5. 



6.2. Basis Functions ipk and Penalty 3{0) ■ 

Here we define the basis functions (pk, in (8), and the regularization penalty 3, in (16), 
used for our simulations. To motivate our choice of basis functions (fk, we note the Fourier 
representation of L2 functions on [— L,i]^: 



ip{x + iy)= ^ {akiM+'^^kiMJe 



i2iT(kix+k2y)/2L 



fci.fcaGZ 

fci,fe2ez ki,k2ez 

for real coefficients akiMi^kiM- These Fourier basis functions are smoothly truncated 
to zero outside of the disk of radius L by multiplying each basis function by T(x,y) = 
[1 — ((x/L)^ -|- (y/L)^)'*]l/x2+y2^^2i (x, y) where 'i-ix^+y^-cL^i is the indicator of the disk of 
radius L. The infinite series is then truncated to a finite expansion to get the following 
(2N + 1)^ basis elements used in our simulations to define {^k}- 

N 

kiM=-N 
The regularization penalty 3 in (16) is defined by, 

ki,k2 
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where 9 is the parameter vector composed of the basis coefficients afci,fc2, hki,k2 ^^^id the 
scale terms ai, 02, 6i, 62 defined in Section (8). Our choice of the penalty is motivated by the 
Fourier representation of the spline penalty /r r ri2 \^v{x,y)\'^dxdy ex X^fcez^ l'^l*l'?(^)P- 
All of our simulations recorded in Tables 1, 2, and 3 used A^ = 6 except that for the 
datasets of size 10,000 we used A^ = 10 for the waffle and A^ = 8 for the stroke in the 
hope that this would allow us to capture finer detail. Generally the results do not appear 
to be sensitive to the choice of A^. For example, some of the waffle/stroke examples were 
also attempted with N — Q\ aside from one exceptional case, the ISE's obtained differ by 
no more than 4 x 10~^ from the values reported in the table. In the exceptional case, 
the optimizer failed to improve upon the initial conditions; upon restarting with slightly 
different initial values it appeared to behave normally. The small dependence on A^ can be 
attributed to the high coefficients of the penalty on high frequency terms. 

6.3. Optimization Details 

For each data set, a grid of values of A was used, namely: 10^-^, 10^-^^, 10"^, . . . , 10"^'^^, 10~^. 
The optimization was performed for each A in turn, from the largest to the smallest, so that 
the parameter estimates and estimates of the local Hessian of the objective (as produced 
by the optimizer) from the more highly regularized cases could serve as initial values for 
the less regularized problems. 

The actual optimization was performed by an adaptation of the BFGS algorithm (see 
Lange (2004)). The coefficients of the basis for the dilatation were broken into real and 
complex parts and the gradient was expressed as a real-valued vector with respect to this 
partition. The Hessian estimate was initialized as a diagonal matrix with 1000 on the 
diagonal. All steps proposed by BFGS were restricted to have a maximum length of 0.02. 
Very long steps are to be avoided because they may introduce numerical error. After every 
successful step, the Hessian-estimate was updated in the customary BFGS manner. 

If a proposed step is too long, the step is shrunk using the trust-region technique (see 
Lange (2004)); i.e. essentially by temporarily increasing the diagonal of the local Hessian in 
the Newton-type step. The end result of this technique is that each step taken is optimal, 
according to the current quadratic approximation to the objective function, subject to the 
constraint that it lie in a ball of the given radius. Similarly, if the step turned out not to 
reduce the objective, the step was retracted and then the target step length was halved. 
If ten successive halvings fail to result in a downward step, the failure was noted and the 
current parameter estimates were stored and the associated deformation was recomputed 
from scratch. This rccomputation also occurs on every tenth step to reduce the accumulation 
of numerical errors. If descent fails again at this point, the optimization is terminated. 

7. Discussion 

The simulation results displayed in Tables 1, 2, and 3, as well as Figure 1 illustrate the 
potential of using transformations for density estimation. We believe that by penalizing 
the transformation of the distribution to be smooth, rather than the directly forcing the 
density estimate to be smooth, we introduce an interesting new regularization strategy. In 
the figure, we see that this approach produces quaHtatively pleasingly smooth estimates of 
the tails. In the case of the halfg example, we see that it successfully incorporated the prior 
information that a smooth edge is present, resulting in an efficient estimate of the location 
and shape of this curve. 
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The quantitative improvements is most dramatic for the stroke example in which the 
relative improvement in mean integrated squared error is respectively 85%, 81%, and 61% 
for sample sizes of 10, 000, 1, 000, and 100. These results sound only slightly less impressive 
if reported in terms of the L^-norm, instead of the ISE (its square), namely improvements of 
61%, 56%, and 38% respectively. These percentages roughly agree with the %-improvements 
seen in terms of the Hellinger and L^ metrics. The results for the halfg example are also 
good, with improvements ranging from 39% to 69% across all sample sizes and criteria. For 
example, even at the sample size of 100, use of the prior information that the sampling den- 
sity is approximately a perturbed half-normal, results in an improvement of 55% over kernel 
density estimation in terms of mean Hellinger distance. The results are least impressive for 
the waffle example, whose complex, bumpy structure is well suited to the kernel-estimate. 
The relative improvements range from 7% to 14% (depending on the metric used) at sam- 
ples sizes of 10, 000 or 1, 000. At the smallest sample size of 100, the mean ISE is worsened 
by 2% (not significant by nominal .05-level t-test), although on these same six realizations, 
there is a (nominally significant) 3% to 5% improvement in the L^ and Hellinger metrics. 

The contributions of this paper include the development of a flexible class of two di- 
mensional transformations and a derivation of the rate of change of the log-likelihood as 
it depends on the unknown transformation. These variational results are then used to 
construct a gradient-based algorithm for estimating the transformation that maximizes a 
penalized log-likelihood. In Section 5 we discuss how to approximate the vector fleld per- 
turbation of /^ by solving a differential equation of the form du^''^^ = L^v. Our results are 
in two dimensions, however, it is our hope that using transformations for density estimation 
can ultimately be useful in high dimensional density estimation by smoothly interpolating 
density in sparse areas while still preserving 'sharp' features of the true density. We consider 
this a first step toward both numerical and theoretical exploration of using invertible maps 
in non-parametric and semi-parametric density estimation in general dimensions. 
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